A community response approach to mental health and substance abuse crises reduced crime

Police officers often serve as first responders to mental health and substance abuse crises. Concerns over the unintended consequences and high costs associated with this approach have motivated emergency response models that augment or completely remove police involvement. However, there is little causal evidence evaluating these programs. This preregistered study presents quasi-experimental evidence on the impact of an innovative “community response” pilot in Denver that directed targeted emergency calls to health care responders instead of the police. We find robust evidence that the program reduced reports of targeted, less serious crimes (e.g., trespassing, public disorder, and resisting arrest) by 34% and had no detectable effect on more serious crimes. The sharp reduction in targeted crimes reflects the fact that health-focused first responders are less likely to report individuals they serve as criminal offenders and the spillover benefits of the program (e.g., reducing crime during hours when the program was not in operation).


Treatment Heterogeneity and Evidence of Robustness
The pre-registered "static" DD specification represented in equation [1] (see main text) assumes that the treatment effect is constant over time. However, the effects of the STAR program could instead have dynamic features. To test for time-varying treatment effects, we also employ a semi-dynamic DD model that unrestrictively allows for treatment effects unique to the month immediately after a precinct first participates and up to five months later: In this model, the three coefficients of interest are represented by , which identify the effects of STAR in the first month of the program (i.e., Sp,m-0) as well as the current effect of having begun one month earlier (i.e., Sp,m-1), two months earlier (i.e., Sp,tm2), and so on. We then test the equivalence of these coefficients of interest using the null hypothesis of a constant treatment effect: 0 : 0 = −1 = −2 = −3 = −4 = −5 We report the semi-dynamic results, both for DD and DDD specifications, in Table S4. Hypothesis tests consistently fail to reject the null hypothesis of a common treatment effect across the 6-month pilot period.
The main text underscores two types of evidence consistent with the internal validity of the pre-registered DD results. One is the absence of a meaningful impact on offenses rated as unrelated to STAR prior to the analysis (i.e., column 3 in Table S4). Table S5 presents the results of a second and important type of evidence. A central and maintained identifying assumption of our preregistered DD approach is that the month-to-month outcome changes among comparison precincts (i.e., those without a change in treatment status) provide a valid counterfactual for what would have changed for treatment precincts in the absence of treatment. This "parallel trends" assumption is fundamentally untestable. However, we can provide empirical evidence on the validity of this important assumption through unrestrictive "event study" specifications that allow us to examine whether treatment and comparison group precincts had similar month-to-month changes in outcomes prior to the onset of treatment. To the extent that this hypothesis is true, it is consistent with the parallel-trends assumption. We examine this question through event-study specifications of the following form: This event-study specification effectively extends the semi-dynamic specification (equation [2]) to allow for fixed effects unique to each month prior to participating in STAR (i.e., "leads" of treatment adoption). That means the coefficients of interest are represented as − and , which designate the "effect" for precinct p in month m of participation in STAR n months in the future or months in the past. The reference category includes those never participating in STAR and those in six months prior to their first participation in STAR. To examine the assumption of parallel trends, we test whether, prior to their participation in STAR, treatment precincts have month-tomonth changes in outcomes distinct from comparison precincts: We report the event-study results, both for STAR-related and unrelated offenses, in Table S5 and Figure 3 in the main text. The results are consistent with the parallel-trends assumption, indicating that we cannot reject the null hypothesis that the treated precincts had month-to-month changes similar to the comparison districts in the months prior to the program activity. Table S6 presents the key results from a variety of alternative specifications that probe the robustness and heterogeneity of the confirmatory finding. First, we consider alternative approaches to conducting inference in this application. Our main estimates allow for precinct-specific clustering in the error term associated with criminal offenses that is heteroscedastic-consistent. However, because there are only 36 unique precincts, this clustering approach may be subject to finite-sample biases. To examine this concern, we report the results based on the procedure recently introduced by Pustejovsky and Tipton (51). The results are quite similar to our reported findings.
As a further and unrestrictive check on our main inference, we also conducted randomization inference with respect to the confirmatory finding. Specifically, over 100,000 replications, we randomly assigned treatment status within precincts and estimated the "impact" of the STAR program. Randomization inference has a particular appeal in applications like this because the data may be better understood as having "design-based" variation in what units are treated rather than having variation due to being drawn from a larger hypothesized population. Figure S4 shows the histogram of estimated effects based on this permutation procedure. Because treatment status was assigned randomly, this distribution can be understood as the distribution of treatment effects when the null hypothesis of no effect is true. Over the 100,000 replications, none of the estimates in this distribution was as large in absolute value as the estimate based on the actual data (i.e., -0.41). This implies a randomization-inference p-value that is less than 0.00001. Table S6 also presents results based on alternative estimation procedures and constructions of the analytical sample. Specifically, Table S6 presents the conditional maximum likelihood (CML) estimates of Poisson and negative binomial specifications that explicitly recognize both the count nature of the offense data and the presence of fixed effects (28). The resulting estimates are quite similar to those based on the pre-registered DD specification. Table S6 also presents the main DD results when dropping data from a STAR-participating police precinct (i.e., precinct 311) where program activities were targeted to a main corridor rather than intending to be active precinct-wide. Though there is no clear reason to expect biases from the onset of the COVID-19 pandemic, especially conditional on month fixed effects, Table S6 also shows the results of using data only from March 2020 (i.e., the onset of the shutdown) onward. Both data edits result in DD estimates consistent with our main finding.
Finally, Table S6 also presents the results of exploring two particular forms of treatment heterogeneity. First, we explored the possibility that the STAR program also led to crime reductions in geographically adjacent precincts. Specifically, we created an additional treatment indicator equal to one only for precincts that were adjacent to STAR precincts when the STAR program was active. The estimated effect reported in Table S6 indicates that we cannot reject the null hypothesis of no effect in neighboring precincts. We also explore possibly heterogeneous treatment effects across days of the week and times of the day when the STAR program was active. As the main text notes, the program was only active Monday through Friday, 10AM to 6PM. We created separate counts for STAR-related offenses that occurred within and outside these weekly windows. The results in Table S6 indicate that the STAR program led to similar reductions in targeted offenses across both time periods. As noted in the main text, this finding is consistent with the hypothesis that the program brought into the health-care system individuals in crisis who would otherwise commit police-reported offenses at other times of the week (i.e., evenings and weekends) as well. Table S7 reports results of DD estimates using generalized synthetic control (45) and comparative interrupted time series (CITS) designs (47), both of which are consistent with our main confirmatory findings. Table S8 reports DD estimates of the impact of the STAR program on overall offenses and on offenses across the broad and mutually exclusive categories defined in the City's NIBRS-based data. The point estimates indicate that the STAR program reduced the natural log of total offenses by a statistically significant 0.15, which implies the 14 percent reduction noted in the main text [i.e., (e -0.15 -1) ✖ 100)]. The estimates by category indicate that these reductions were plausibly concentrated in offenses such as "alcohol and drugs" (i.e., -0.53), "disorderly conduct" (i.e., -0.20), and "other crimes against people" (i.e., -0.14).

Pre-Registration Plan
The following is our detailed pre-registration plan, filed on February 14, 2021 prior to any data analysis related to the study.
A. Study Information

Hypotheses
Precincts participating in the STAR program will have reduced prevalence of criminal offenses related to mental health, poverty, homelessness, and substance abuse in the City of Denver.

Study type
Observational Study -Data is collected from study subjects that are not randomly assigned to a treatment. This includes surveys, "natural experiments," and regression discontinuity designs.

Blinding
No blinding is involved in this study.

Study design
We focus on recorded offenses in each city precinct in a given month from December 1, 2019 though November 30, 2020. This time period represents the six-month pilot phase of STAR (June 2020-November 2020) and the six months prior to the pilot beginning. This design strategy allows us to take advantage of our panel dataset in months surrounding implementation. Our analytical sample consists of 36 precincts and 432 precinct-month observations, from December 2019 through November 2020.

Existing Data
Registration prior to analysis of the data

Explanation of existing data
The data come from open access police records provided by the city of Denver, CO (https://www.denvergov.org/opendata/dataset/city-and-county-of-denver-crime). These data include criminal incident records from January 2, 2016 through January 15, 2021 involving adults. Due to legal restrictions, these data do not report crimes that by nature involve juveniles as victims (e.g., child abuse offenses), suspects or witnesses. These data also exclude "unfounded" incidents, which authorities have determined did not actually occur after they are reported.

Data collection procedures
We downloaded open access police records provided by the city of Denver, CO (https://www.denvergov.org/opendata/dataset/city-and-county-of-denver-crime). We retain recorded offenses in each city precinct in a given month from December 1, 2019 though November 30, 2020. This time period represents the six-month pilot phase of STAR (June 2020-November 2020) and the six months prior to the pilot beginning.

Sample Size
Our analytical sample consists of 36 precincts and 432 precinct-month observations, from December 2019 through November 2020.

Sample size rationale
This sampling allows for observation of criminal offenses in the city six months before and six months after the beginning of the STAR program, which allows for ample observation of pre and post treatment outcomes, tests of critical model assumptions, and for dynamic effects of the program.

Measured variables
Our outcome of interest is semi-logged precinct-month counts of STAR-related types of criminal offenses. The City of Denver codes recorded criminal offenses into fifteen overarching categories, including aggravated assault, arson, auto theft, burglary, drug and alcohol offenses, larceny, murder, public disorder, robbery, sexual assault, theft from motor vehicles, traffic accidents, white collar crimes, other crimes against individuals, and all other crimes. These categories give some sense of the types of crimes that might be related to the STAR programs aims but continue to carry a substantial amount of noise for our treatment estimates. From those 15 categories, offenses are differentiated by 199 different types. We coupled information on offense type descriptions and data on their frequencies with independent rater coding to identify those offenses that are STARrelevant and those that are not. We measure treatment status by capturing each precinct's monthly participation in the STAR program. Using these data, we construct a simple binary indicator equal to 1 for precinct-month observations from precincts who participate in STAR during a given month (i.e., a "static" measure of treatment). We also use the timing of STAR participation to define less restrictive and flexibly dynamic measures of program participation. These include binary indicators for the month that the program began (June 2020) and separate indicators for being one through five months after that first participation month. These measures flexibly allow for the initial participation in STAR to have effects that increase or decline over time.

Indices
From the City of Denver's 15 overarching criminal offense categories, we differentiate those offenses that are most related to the types of calls that the STAR team will respond to from other types of offenses that are unlikely to either substitute for a noncriminal STAR team visit or would result from an escalation of such non-criminal offenses. From those 15 offense categories, offenses are differentiated by 199 different types. We coupled information on offense type descriptions and data on their frequencies with independent rater coding to identify those offenses that are STAR-relevant and those that are not.

Statistical models
Our main confirmatory analysis is based on a difference-in-differences (DD) design, which assumes that STAR activity in a given precinct and month leads to a constant, onetime change in STAR-related criminal offenses for participating precincts. We do so by comparing changes in these outcomes among precincts participating in STAR to outcomes of precincts that either never participated or had yet to participate in STAR. The outcome will be a semi-logged count of STAR-related criminal offenses. The predictors will be (1) an indicator of a treated precinct in a treated month, (2) precinct fixed effects, and (3) month fixed effects. Standard errors will be clustered at the precinct level.

Transformations
We use precinct-month counts of STAR-related criminal offenses in the panel data to estimate the effects of the STAR program on the number of offenses committed in each precinct. We transform the outcome variable, which is the semi-logged count of STARrelated criminal offenses for precinct p in month m.

Inference criteria
We will make inferences of our confirmatory analysis using two-tailed tests and p-values of p<.10. We will report p-values differently based on thresholds of p<.01, p<.05, and p<.10.

Data exclusion
We exclude data prior to December 2019 and after November 2020.

Missing data
There are no instances of missing precinct-month data, including criminal offenses recorded in a given precinct-month. Thus, we observe no missing data for our confirmatory analyses. However, in some exploratory analyses we examine program effects at the precinct-week level. For instances in which there are no STAR-related offenses in a given week, we will replace the missing value with the natural log of 0.5. We do the same for STAR-unrelated offenses in exploratory analyses.

Exploratory analysis
We will conduct a number of exploratory analyses. First, to test for time-varying treatment effects, we next employ a semi-dynamic DD model that unrestrictively allows for treatment effects unique to the month immediately after a precinct first participates and up to seven months later. We then test the equivalence of these coefficients of interest using the null hypothesis of a constant treatment effect. Second, we will conduct an "event study" analysis. A crucial maintained assumption of our DD approach is that the month-to-month outcome changes among "control" precincts (i.e., those without a change in treatment status) provide a valid counterfactual for what would have changed for treatment precincts in the absence of treatment. This "parallel trends" assumption is fundamentally untestable. However, we can provide qualified evidence on the validity of this important assumption through unrestrictive "event study" specifications that allow us to examine whether treatment and control group precincts had similar month-to-month changes in outcomes prior to the onset of treatment. To the extent that this hypothesis is true, it is consistent with the parallel trends assumption. We examine this question through event-study specifications. Third, because these data also include counts of criminal offenses that are unrelated to the STAR programs goals, there is an opportunity to test a "triple diff" (DDD) research design that allows us to account for unobserved disturbances in precinct-month observations. Stacking our data at the precinct-month-(STAR & non-STAR) offense level, the DDD specification includes fixed effects for all two-way interactions. Fourth, we will rerun the static DD model for each of the 15 criminal offense category outcomes reported in the original dataset. Fifth, to test for potential differential effects of the COVID pandemic on criminal offenses, we rerun the confirmatory analysis but only include offenses from March 2020 through November 2020. Sixth, we rerun the confirmatory analysis using a count outcome in a negative binomial precinct fixed effects model. Seventh, we analyze the confirmatory outcome during STAR-eligible and STAR-ineligible times. Eighth, another model tests for spillover effects of the STAR program in precincts adjacent to the participating precincts. Ninth, we examine static and semi-dynamic program effects at the precinct-week level, for STAR-related and STAR-unrelated criminal offenses.

Deviations from Pre-Registration Plan
Our main results do not deviate from the pre-registration plan. However, we have added several additional exploratory analyses not reported in the original pre-registration plan. First, in a robustness check we recode "simple assaults", "simple assaults on police officers", and "disarming a piece officer" as STAR-unrelated offenses (see Table S6). Second, we include a static DD model in which we recode May 2020 as a "treatment" month among STAR-active precincts, to test for anticipation to the program's start (see Table S6). Third, in Table S5 we include an additional pre-trends F-test for only months during COVID restrictions (i.e., March 2020 -May 2020). Fourth, we conduct placebo static DD and event study tests in years prior to our study window (see Table S6 and Figures S5-S7). Finally, we conduct additional robustness checks using a generalized synthetic control (GSC) design and a comparative interrupted time series (CITS) approach (see Table S7).      Figure 3 and Table S5, but applied to a time period other than our study window. Specifically, the difference-in-differences (DD) estimates are based on 432 precinctmonth observations from December 2016 through November 2017 and condition on precinct fixed effects and month fixed effects. The outcome variables are the natural log of offense counts, differentiated by those that are STAR-related and those that are not. The horizontal line at zero denotes the baseline levels of offenses.  Figure 3 and Table S5, but applied to a time period other than our study window. Specifically, the difference-in-differences (DD) estimates are based on 432 precinctmonth observations from December 2017 through November 2018 and condition on precinct fixed effects and month fixed effects. The outcome variables are the natural log of offense counts, differentiated by those that are STAR-related and those that are not. The horizontal line at zero denotes the baseline levels of offenses.  Figure 3 and Table S5, but applied to a time period other than our study window. Specifically, the difference-in-differences (DD) estimates are based on 432 precinctmonth observations from December 2018 through November 2019 and condition on precinct fixed effects and month fixed effects. The outcome variables are the natural log of offense counts, differentiated by those that are STAR-related and those that are not. The horizontal line at zero denotes the baseline levels of offenses.     Table S7. Comparative interrupted time series estimates. The dependent variable is the natural log of the stated offenses (n = 432 precinct-month observations). The first two columns report estimates based on generalized synthetic control (GSC; 45) and bootstrapped standard errors (1,000 replications). The next two columns report estimates based on a comparative interrupted time-series (CITS) specification (46) and standard errors clustered at the precinct level. The CITS specifications also condition on precinct fixed effects and month fixed effects. *** p<0.01, ** p<0.05, * p<0.10.  Table S8. Estimated Effects by Offense Category. The difference-in-differences (DD) estimates are based on 432 precinct-month observations and condition on precinct fixed effects and month fixed effects. Due to the very low instances of arson, murder, robbery, sexual assault, and whitecollar offenses, we do not include those STAR-unrelated offense categories. The outcome variables are the natural log of the offense counts. Standard errors, clustered at the precinct level, are in parentheses. *** p<0.01, ** p<0.05, * p<0.10.